Skip to content

Add check_aws_events health_checks plugin (IMDSv2 maintenance event poll) (#139)#139

Merged
meta-codesync[bot] merged 1 commit into
facebookresearch:mainfrom
kenerwin88:export-D104060213
May 7, 2026
Merged

Add check_aws_events health_checks plugin (IMDSv2 maintenance event poll) (#139)#139
meta-codesync[bot] merged 1 commit into
facebookresearch:mainfrom
kenerwin88:export-D104060213

Conversation

@kenerwin88

@kenerwin88 kenerwin88 commented May 6, 2026

Copy link
Copy Markdown
Contributor

Summary:

Adds check-aws-events, a new GCM health_checks Click subcommand that polls EC2 IMDSv2 (/latest/meta-data/events/maintenance/scheduled) for pending instance maintenance / retirement events scheduled against the local node. Surfaces them as a node condition via NPD's exit-code translation so operators can drain / cordon / replace the instance ahead of AWS's enforced NotBefore rather than letting workloads be killed when AWS rotates the host.

Endpoint, method, and headers match the AWS IMDSv2 spec used by aws-node-termination-handler for the scheduled-events endpoint (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-instances-status-check_sched.html#viewing_scheduled_events). Conservative fail-safe semantics: any transport / unreachable / non-list / non-dict / malformed-payload response returns ExitCode.OK so a transient IMDS blip can never trigger a fleet-wide drain. Only a 200 + non-empty events array exits WARN with a one-line summary (Code NotBefore=... State=... EventId=...), which NPD records as the condition message.

Files:

  • gcm/health_checks/checks/check_aws_events.py (new) — Click subcommand + two helpers (fetch_imds_token, fetch_scheduled_events) that the test suite injects fakes for via click.pass_obj. Always passes proxies={"http": "", "https": ""} so requests never honors HTTP_PROXY and accidentally routes IMDS at a proxy server's metadata.
  • gcm/health_checks/checks/__init__.py + gcm/health_checks/cli/health_checks.py — register.
  • gcm/schemas/health_check/health_check_name.py — new HealthCheckName.CHECK_AWS_EVENTS = "check aws events" for telemetry.
  • gcm/tests/health_checks_tests/test_check_aws_events.py (new) — 19 tests covering token fetch (happy / off-EC2 / 5xx / empty body / proxies bypass / trailing-slash), events response (200-empty / 404 / one-pending / multi-pending / non-list / non-dict / unreachable / 5xx / garbage / proxies bypass / trailing-slash), and full Click command exit codes (off-EC2 → OK, pending → WARN with summary).
  • BUCK — add requests to :health_checks library deps and requests + requests-mock to :health_checks_pytest.

Differential Revision: D104060213

@meta-codesync

meta-codesync Bot commented May 6, 2026

Copy link
Copy Markdown

@kenerwin88 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D104060213.

@github-actions

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown

CI Commands

The following CI workflows run automatically on every push and pull request:

Workflow What it runs
GPU Cluster Monitoring Python CI lint, tests, typecheck, format, deb build, pyoxidizer builds
Go packages CI shelper tests, format, lint

The following commands can be used by maintainers to trigger additional tests that require access to secrets:

Command Description Requires approval?
/metaci tests Runs Meta internal integration tests (pytest) Yes — a maintainer must trigger the command and approve the deployment request
/metaci integration tests Same as above (alias) Yes

Note: Only repository maintainers (OWNER association) can trigger /metaci commands. After commenting the command, a maintainer must also navigate to the Actions tab and approve the deployment to the graph-api-access environment before the jobs will run. See the approval guidelines for what to approve or reject.

@luccabb

luccabb commented May 6, 2026

Copy link
Copy Markdown
Contributor

@kenerwin88 thanks for the contribution! Overall LGTM just 2 things:

  1. killswitches support: https://facebookresearch.github.io/gcm/docs/GCM_Health_Checks/adding_new_health_check/#6-add-the-killswitch-feature-flag
  2. add docs: https://facebookresearch.github.io/gcm/docs/GCM_Health_Checks/adding_new_health_check/#8-add-website-documentation

@luccabb luccabb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to add the above

…oll) (facebookresearch#139)

Summary:

Adds `check-aws-events`, a new GCM `health_checks` Click subcommand that polls EC2 IMDSv2 (`/latest/meta-data/events/maintenance/scheduled`) for pending instance maintenance / retirement events scheduled against the local node. Surfaces them as a node condition via NPD's exit-code translation so operators can drain / cordon / replace the instance ahead of AWS's enforced `NotBefore` rather than letting workloads be killed when AWS rotates the host.

Endpoint, method, and headers match the AWS IMDSv2 spec used by `aws-node-termination-handler` for the scheduled-events endpoint (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-instances-status-check_sched.html#viewing_scheduled_events). Conservative fail-safe semantics: any transport / unreachable / non-list / non-dict / malformed-payload response returns `ExitCode.OK` so a transient IMDS blip can never trigger a fleet-wide drain. Only a `200 + non-empty events array` exits `WARN` with a one-line summary (`Code NotBefore=... State=... EventId=...`), which NPD records as the condition message.

Files:
- `gcm/health_checks/checks/check_aws_events.py` (new) — Click subcommand + two helpers (`fetch_imds_token`, `fetch_scheduled_events`) that the test suite injects fakes for via `click.pass_obj`. Always passes `proxies={"http": "", "https": ""}` so `requests` never honors `HTTP_PROXY` and accidentally routes IMDS at a proxy server's metadata.
- `gcm/health_checks/checks/__init__.py` + `gcm/health_checks/cli/health_checks.py` — register.
- `gcm/schemas/health_check/health_check_name.py` — new `HealthCheckName.CHECK_AWS_EVENTS = "check aws events"` for telemetry.
- `gcm/tests/health_checks_tests/test_check_aws_events.py` (new) — 19 tests covering token fetch (happy / off-EC2 / 5xx / empty body / proxies bypass / trailing-slash), events response (200-empty / 404 / one-pending / multi-pending / non-list / non-dict / unreachable / 5xx / garbage / proxies bypass / trailing-slash), and full Click command exit codes (off-EC2 → OK, pending → WARN with summary).
- `BUCK` — add `requests` to `:health_checks` library deps and `requests` + `requests-mock` to `:health_checks_pytest`.

Differential Revision: D104060213
@meta-codesync meta-codesync Bot changed the title Add check_aws_events health_checks plugin (IMDSv2 maintenance event poll) Add check_aws_events health_checks plugin (IMDSv2 maintenance event poll) (#139) May 7, 2026
@kenerwin88 kenerwin88 force-pushed the export-D104060213 branch from faab1b9 to 454af8a Compare May 7, 2026 18:12
@kenerwin88 kenerwin88 requested a review from luccabb May 7, 2026 18:32
@kenerwin88

Copy link
Copy Markdown
Contributor Author

Added killswitch and docs :). Ty @luccabb !

@luccabb luccabb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@meta-codesync meta-codesync Bot merged commit c2396d3 into facebookresearch:main May 7, 2026
21 of 23 checks passed
@kenerwin88 kenerwin88 deleted the export-D104060213 branch May 8, 2026 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants